[conflicted] Will prefer haven::is.labelled over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
Data Management II
Agenda
- Manipulating Dataframes
- Joining Dataframes
joining data missing data? & outliers - dates? - lists
Learning objectives
By the end of the lecture, you will be able to …
- Reshape data between wide and long formats
- Reorder columns and rows
- Combine and expand dataframes
Code-along 04
Download and open code-along-04.qmd
GSS Panel Data: Download
https://gss.norc.org/get-the-data/stata
Manipulating Dataframes
Selection helpers
Match variables according to a given pattern.
starts_with(): Starts with an exact prefix.ends_with(): Ends with an exact suffix.contains(): Contains a literal string.- …
my_data <- data |>
select(yearid, wtssnr_2,
starts_with("age_"),
starts_with("family16_"),
starts_with("socfrend_"),
starts_with("childs_")) . . .
head() & tail()
Look at the first few column names and first few rows.
head(my_data, n = 5)# A tibble: 5 × 14
yearid wtssnr_2 age_1a age_1b age_2 family16_1a family16_1b family16_2
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 20160001 1.44 47 <NA> 51 both own mother … <NA> not avail…
2 20160002 0.722 61 <NA> 65 both own mother … <NA> not avail…
3 20160003 NA 72 <NA> <NA> both own mother … <NA> iap
4 20160004 2.89 43 <NA> 47 mother only <NA> not avail…
5 20160005 NA 55 <NA> <NA> both own mother … <NA> iap
# ℹ 6 more variables: socfrend_1a <fct>, socfrend_1b <fct>, socfrend_2 <fct>,
# childs_1a <fct>, childs_1b <fct>, childs_2 <fct>
. . .
Look at the first few column names and last few rows.
tail(my_data, n = 5)# A tibble: 5 × 14
yearid wtssnr_2 age_1a age_1b age_2 family16_1a family16_1b family16_2
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 20182344 NA <NA> 37 <NA> <NA> mother and stepf… iap
2 20182345 0.995 <NA> 75 77 <NA> both own mother … not avail…
3 20182346 0.995 <NA> 67 70 <NA> both own mother … not avail…
4 20182347 NA <NA> 72 <NA> <NA> both own mother … iap
5 20182348 NA <NA> 79 <NA> <NA> both own mother … iap
# ℹ 6 more variables: socfrend_1a <fct>, socfrend_1b <fct>, socfrend_2 <fct>,
# childs_1a <fct>, childs_1b <fct>, childs_2 <fct>
Reminder: Tidy data
. . .
This data is NOT tidy!
Some column names include values of a variable (survey year).
# A tibble: 15,645 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <chr> <fct> <fct> <fct> <fct>
1 20160001 1.44 1a 47 both own mother and father several tim… 3
2 20160001 1.44 1b <NA> <NA> <NA> <NA>
3 20160001 1.44 2 51 not available for this year several tim… 3
4 20160002 0.722 1a 61 both own mother and father several tim… 0
5 20160002 0.722 1b <NA> <NA> <NA> <NA>
6 20160002 0.722 2 65 not available for this year about once … 0
7 20160003 NA 1a 72 both own mother and father <NA> 2
8 20160003 NA 1b <NA> <NA> <NA> <NA>
9 20160003 NA 2 <NA> iap <NA> <NA>
10 20160004 2.89 1a 43 mother only once or twi… 4
# ℹ 15,635 more rows
This data is tidy!
Each variable in its own column, and each observation in its own row.
pivot_longer()
my_data_long <- my_data |>
pivot_longer(
1 cols = 3:14,
2 names_to = "variable",
3 values_to = "value")
head(my_data_long, n = 5)- 1
-
cols =specifies which columns you want to turn into one - 2
-
names_to =defines the name of the new variable containing the current variable names - 3
-
values_to=defines the name of the new variable that takes in the values of the variables
# A tibble: 5 × 4
yearid wtssnr_2 variable value
<dbl> <dbl> <chr> <fct>
1 20160001 1.44 age_1a 47
2 20160001 1.44 age_1b <NA>
3 20160001 1.44 age_2 51
4 20160001 1.44 family16_1a both own mother and father
5 20160001 1.44 family16_1b <NA>
. . .
This is also not tidy data!
The variable column contains the variable names & the panel ID.
separate()
my_data_long <- my_data |>
pivot_longer(
cols = c(-yearid, -wtssnr_2),
names_to = "variable",
values_to = "value") |>
1 separate_wider_delim(variable,
2 delim = "_",
3 names = c("variable", "panel"))
head(my_data_long, n = 5)- 1
-
Split the column
variableinto two using a delimiter - 2
- A string giving the delimiter between values
- 3
-
namesspecifies the two new column names
# A tibble: 5 × 5
yearid wtssnr_2 variable panel value
<dbl> <dbl> <chr> <chr> <fct>
1 20160001 1.44 age 1a 47
2 20160001 1.44 age 1b <NA>
3 20160001 1.44 age 2 51
4 20160001 1.44 family16 1a both own mother and father
5 20160001 1.44 family16 1b <NA>
. . .
This is still not tidy data!
The value variable is a mix of different types of values
pivot_wider()
my_data <- my_data |> # overwriting my_data
pivot_longer(
cols = c(-yearid, -wtssnr_2),
names_to = "variable",
values_to = "value") |>
separate_wider_delim(variable,
delim = "_",
names = c("variable", "panel")) |>
1 pivot_wider(
2 names_from = variable,
3 values_from = value)
head(my_data, n = 5)- 1
- Increasing the number of columns and decreasing the number of rows
- 2
- Which column to get the name of the output columns
- 3
- Which column to get the cell values from
# A tibble: 5 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <chr> <fct> <fct> <fct> <fct>
1 20160001 1.44 1a 47 both own mother and father several time… 3
2 20160001 1.44 1b <NA> <NA> <NA> <NA>
3 20160001 1.44 2 51 not available for this year several time… 3
4 20160002 0.722 1a 61 both own mother and father several time… 0
5 20160002 0.722 1b <NA> <NA> <NA> <NA>
Recode the reshaped variable
my_data <- my_data |>
mutate(panel = case_when(
panel == "1a" ~ 2016,
panel == "1b" ~ 2018,
panel == "2" ~ 2020,
TRUE ~ NA_integer_))
head(my_data, n = 3)# A tibble: 3 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 20160001 1.44 2016 47 both own mother and father several time… 3
2 20160001 1.44 2018 <NA> <NA> <NA> <NA>
3 20160001 1.44 2020 51 not available for this year several time… 3
. . .
relocate()
my_data <- my_data |>
relocate(panel)
head(my_data, n = 2)# A tibble: 2 × 7
panel yearid wtssnr_2 age family16 socfrend childs
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 2016 20160001 1.44 47 both own mother and father several times… 3
2 2018 20160001 1.44 <NA> <NA> <NA> <NA>
. . .
my_data <- my_data |>
relocate(panel, .after = yearid)
head(my_data, n = 2)# A tibble: 2 × 7
yearid panel wtssnr_2 age family16 socfrend childs
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 20160001 2016 1.44 47 both own mother and father several times… 3
2 20160001 2018 1.44 <NA> <NA> <NA> <NA>
arrange()
my_data |>
arrange(panel) |>
select(yearid, panel, age, family16)# A tibble: 15,645 × 4
yearid panel age family16
<dbl> <dbl> <fct> <fct>
1 20160001 2016 47 both own mother and father
2 20160002 2016 61 both own mother and father
3 20160003 2016 72 both own mother and father
4 20160004 2016 43 mother only
5 20160005 2016 55 both own mother and father
6 20160006 2016 53 other
7 20160007 2016 50 both own mother and father
8 20160008 2016 23 both own mother and father
9 20160009 2016 45 both own mother and father
10 20160010 2016 71 both own mother and father
# ℹ 15,635 more rows
. . .
my_data |>
arrange(desc(panel)) |>
select(yearid, panel, age, family16)# A tibble: 15,645 × 4
yearid panel age family16
<dbl> <dbl> <fct> <fct>
1 20160001 2020 51 not available for this year
2 20160002 2020 65 not available for this year
3 20160003 2020 <NA> iap
4 20160004 2020 47 not available for this year
5 20160005 2020 <NA> iap
6 20160006 2020 <NA> iap
7 20160007 2020 <NA> iap
8 20160008 2020 27 not available for this year
9 20160009 2020 49 not available for this year
10 20160010 2020 <NA> iap
# ℹ 15,635 more rows
Joining Dataframes
Think Like a Statistician
Are married people above or below average in internet use or income? Does it vary by survey year?
How do we find out?
Answering this research question takes a few steps. But, the first step is to create a dataframe with all the necessary information.
Can you reproduce this table?
# A tibble: 15,645 × 7
yearid panel marital wwwhr realrinc avg_www avg_inc
<dbl> <dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl> <dbl>
1 20160001 2016 married 15 164382. 14.2 24929.
2 20160001 2018 <NA> NA(i) NA(i) 14.8 26103.
3 20160001 2020 married 20 147659. 15.0 28222.
4 20160002 2016 never married 5 [never married] 25740 14.2 24929.
5 20160002 2018 <NA> NA(i) NA(i) 14.8 26103.
6 20160002 2020 never married 10 23980 15.0 28222.
7 20160003 2016 married NA(i) NA(i) 14.2 24929.
8 20160003 2018 <NA> NA(i) NA(i) 14.8 26103.
9 20160003 2020 <NA> NA(i) NA(i) 15.0 28222.
10 20160004 2016 married 7 5265 14.2 24929.
# ℹ 15,635 more rows
What’s your conclusion to our research question?
# Income status
ctable(think_married$income_status, think_married$panel,
prop = "c",
format = "p",
useNA = "no",
headings = FALSE
)
--------------- ------- -------------- -------------- -------------- ---------------
panel 2016 2018 2020 Total
income_status
Above Average 328 ( 43.8%) 235 ( 38.3%) 247 ( 45.6%) 810 ( 42.5%)
Below Average 421 ( 56.2%) 379 ( 61.7%) 295 ( 54.4%) 1095 ( 57.5%)
Total 749 (100.0%) 614 (100.0%) 542 (100.0%) 1905 (100.0%)
--------------- ------- -------------- -------------- -------------- ---------------
# Internet status
ctable(think_married$internet_status, think_married$panel,
prop = "c",
format = "p",
useNA = "no",
headings = FALSE
)
----------------- ------- -------------- -------------- -------------- ---------------
panel 2016 2018 2020 Total
internet_status
Above Average 207 ( 29.4%) 180 ( 29.4%) 165 ( 33.4%) 552 ( 30.5%)
Below Average 496 ( 70.6%) 433 ( 70.6%) 329 ( 66.6%) 1258 ( 69.5%)
Total 703 (100.0%) 613 (100.0%) 494 (100.0%) 1810 (100.0%)
----------------- ------- -------------- -------------- -------------- ---------------